A Post by Michael B. Spring

(A list of all posts by M.B. Spring)

The Federation and Balkanization of Information (June 18, 2009)

There is little doubt that we are learning new ways to use information to transform economic and social enterprises. In e-business, one of the key concepts I teach is the notion of replacing inventory with information. The lecture is long and detailed, but let it suffice here to suggest that it pretty easy to see that inventory represents an investment of money and that it costs money (storage space, pilferage, etc.) to store it. If we have perfect information about our needs, we can manage inventory on demand. Thus, we replace expensive inventory with cheap information. In similar ways, the use of the internet is having a dramatic impact on our social system, from politics to health care. In the midst of all this, we all have a clear sense of what information is, but good scientific definitions elude us. I believe it is important that we have a better sense of how measure information, to determine its worth, to understand how it flows and is transformed, how it is aggregated and balkanized, etc. Information balkanization is only one manifestation of the effort to control and manage information. In a sense balkanization of information has served to some extent to protect information about me, but there are signs of information federation that allows partners to share information. I fear that costly balkanization will soon give way to revenue generating federation.

Definitions of Information

Definitionally, there are a variety of ways to establish information metrics. It is theoretically appealing to fall back to Shannon’s definition of information as a measure of the entropy in a signal. Unfortunately, this definition does little to inform economic or social policy. While information can be measured objectively in terms of entropy, it is the impact it has on people and systems that may be more critical. The common sense social definition – i.e. information is that which I don’t know already – is interesting in that it makes all measure of information relative. (I suspect you already knew that.) We might try to be more formal and say that data that causes a change in the state of the receiving system is information. We might try to build a more complete model and suggest that information encoded into a system constitutes knowledge. This approach is appealing in that it links the definitions of data, information, and knowledge. It is unappealing in that it lacks a quantitative metric, and even if one existed, it would need be heavily influenced by the individual receiving the data. (What is information to you may not be information to me. At the same time, we might both share the same knowledge.)

Public versus private information

One of the interesting problems we face today is the aggregation of information about us on the web. Much of this information is balkanized, and some of it is more federated than we know. What is more interesting is that the information about us includes both public information about us that we share (a photograph) as well as private information we may volunteer in the anticipation that it will be kept confidential (a credit card number or our birthdate). But there is also information about us that can be gathered from our clicks that reveals things about us that we might not know. This brings to mind the “Johari Window” proposed by Joseph Luft and Harry Ingham in the 1950s as a conceptual model which was used in counseling and self help groups. They suggest, in a narrow context, that there are four categories of information about an individual:

Johari Window		Known to self
Johari Window		yes	no
Known to others	Yes	open	blind
Known to others	No	hidden	unknown

It is surely the case that all four types of information exist on the web. We have discussed three of them already. The fourth, unknown, is simply waiting for the right data mining technique.

Ownership and shared information

This leads one to interesting notions of the ownership of information and the location of information. Perhaps the most interesting information in this category is medical information. Consider an individual’s medical record. Who owns this record? Is it the physician who created it? Is it the patient it is about? The matter may be further complicated by considering the components of a medical record. Who owns the following:

An x-ray of my lung.
A reading of the x-ray
A diagnosis based on the reading

I would like to believe that my medical record, my employee record, my education record are all my property, but it is not clear that is the case. I don’t believe any mail system suggests that the author owns the content. I know at my University it is the University, not me, who legally owns it. The challenges to the perception that it is my personal infomration are few and far between and generally meet acceptable social criteria for the intrusion, but they are there.

“Order” of informaiton

Some information is primary – for example, one might consider the statement that it rained in Pittsburgh primary. This could be called first order information. Collecting these pieces of information for a year, one might derive a piece of second order information such as 2003 saw rain on 30% of the days in Pittsburgh. Collecting this information for several years, one could derive a piece of third order information – over 20 years, the average number of rainy days in Pittsburgh is decreasing. Information might also be categorized in terms of how it is derived, and this might impact other properties – such as ownership. For example, the fact that a person has a certain height or weight might be defined as first order information. The representation of that information as a certain number of inches or a certain number of pounds might be declared second order information. The fact that a person is “overweight” is obtained by a function of height and weight in accord with some algorithm. Is the information that a person is “overweight” their information or does it belong to the person who applied the algorithm?

High grade versus low grade

Some information that describes me explicitly – my height, my medical condition – might be considered high-grade information about me. Low grade information might include:

What books at amazon.com at which I look
When I am logged onto the internet
What Microsoft applications I use, or what I ask for help about

Do I have the same right of ownership of high and low grade information. Indeed how is the ownership established? When low grade information is anonymous and aggregated, who owns it? Generally, I do not control this kind of information about me. When is information explicitly or implicitly transferred? As with the shared ownership issue, this category again asks who actually owns what portion of the information.

Operations on Information

We lack formalizations for operations on information. While we know how to copy artifact that contain information, or delete them, it is not clear how do we determine if two pieces of information are identical? How do we “subtract” or “add” two pieces of information? How do we measure the change in information after transformation. How do we transform information from first to second order?

From an end-user perspective, the operational problems with information include such things as understanding the ownership of information, dissemination of information about individuals, tracking of information flows, etc. A variety of disciplines may contribute to more operational and rigorous definitions of information.

We might be concerned about keeping private information private – how do I secure information about me. How do I control information I divulge to others? How do I insure it is not reused without my permission? Is there a difference between selling my email address to others, telling others I am looking for a mortgage, or telling others I work at a University?

What is the role of public-private key encryption in controlling access to “jointly-owned” information – e.g. a physician can keep information about me, but it may only be released when accompanied by both my and the physician’s private keys. How does this impact medical research? How does this impact emergency care? What happens to my information when I cease to exist? How does my privacy impact the public right to be safe?

Can information be doped? Explosives are tagged or doped with trace chemicals that allow the origin of the explosive to be traced in the event that it is used in a crime. Like some compression, the doping process is asymmetric. It is low cost to dope an explosive and high cost to trace the tags. Like compression, the basic idea is to keep the costs of the most frequent operation low and allow the costs of the less frequent operation to grow. Thus, when a file is infrequently compressed and very frequently decompressed, the cost of the compression can be high while the cost of decompression is kept low. How might doping or watermarking be used trace illicit use or dissemination of information?